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ABSTRACT 

We study dynamical mass measurements of galaxy clusters contaminated by interlopers and show 
that a modern machine learning (ML) algorithm can predict masses by better than a factor of two 
compared to a standard scaling relation approach. We create two mock catalogs from Multidark’s 
publicly available 7V-body MDPLl simulation, one with perfect galaxy cluster membership infor¬ 
mation and the other where a simple cylindrical cut around the cluster center allows interlopers to 
contaminate the clusters. In the standard approach, we use a power-law scaling relation to infer 
cluster mass from galaxy line-of-sight (LOS) velocity dispersion. Assuming perfect membership 
knowledge, this unrealistic case produces a wide fractional mass error distribution, with a width 
of Ae « 0.87. Interlopers introduce additional scatter, significantly widening the error distribution 
further (Ae « 2.13). We employ the support distribution machine (SDM) class of algorithms to learn 
from distributions of data to predict single values. Applied to distributions of galaxy observables such 
as LOS velocity and projected distance from the cluster center, SDM yields better than a factor-of-two 
improvement (Ae « 0.67) for the contaminated case. Remarkably, SDM applied to contaminated 
clusters is better able to recover masses than even the scaling relation approach applied to uncon¬ 
taminated clusters. We show that the SDM method more accurately reproduces the cluster mass 
function, making it a valuable tool for employing cluster observations to evaluate cosmological models. 

Subject headings: cosmology: theory—dark matter—galaxies: clusters: general—galaxies: kinematics 
and dynamics—gravitation—large-scale structure of universe—methods: statistical 


1. INTRODUCTION 

Galaxy clusters are the most massive gravitationally- 
bound systems in the Universe. They are dark matter 
dominated, and have halos of mass > 10^^ Mq h~^. The 
majority of multiple-wavelength observations do not di¬ 
rectly probe the dark matter distribution, but the bary- 
onic component of clusters: the hot gas and tens to thou¬ 
sands of galaxies contained within the halo. Glusters 
have complex substructure and internal dynamics, and 
grow through hierarchical merging and the accretion of 
matter from the cosmic web. Gluster abundance as a 
function of mass and redshift is sensitive to the underly¬ 
ing dark matter and dark energy content of the Uni verse 
and ca n b e used to test cosm ological models. See |Voit| 


(2005) and Allen et al. (2011) for a review. 


to constrain cosmological parameters (e.g. ISchuecker 

et al. 

2003 

IHenry et al.|20091|Vikhlinin et al.|20U9t|Rozo 

et al. 

2010 

IMantz et al.|2010 

IVanderlinde et al.|20101|Se- 

hgal et al. 

20111 (Alien et al. 

20111 IPlanck Collaboration 

et al.||2014 

b Mantz et al.||2015p, capitalizing on clusters 


pie of cluster observations, a connection linking the ob¬ 
servations of the baryonic component to the underlying 
dark matter, and a good understanding of the intrinsic 
scatter in the mass-observable relationship. A variety 
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of methods connecting observables to cluster mass ex¬ 
ist, utilizing observations across multiple wavelengths. 
A subset of these techniques, broadly labeled dynami¬ 
cal mass measurements, are based on measurements of 
galaxy kinematics. Dynamical mass measurements uti¬ 
lize line-of-sight (LOS) velocities of the galaxies within 
the virial radius of the cluster, and may also take advan¬ 
tage of the unvirialized matter falling toward the cluster. 

The virial theorem approach considers cluster mem¬ 
bers’ LOS velocity dispersion, cr^. This method scales 
ha lo mass, M, wi th as a power law and famously led 
to |Zwicky[ s |1933 discovery of dark matter in the Goma 
cluster. Dynamical mass measurements based on the 
virial theorem continue to be used to determine clus- 


ter masses 

(e-fi- 

Brodwin et al. 

20101 Rines et al. 2010 

Sifon et al. 

2013 

Ruel et al.||20 

14 Bocquet et al.||z015). 

Old et al. 

2014 

) and Old et al. 

(2015) provide a com- 


galaxy observables. Even when cluster membership is 
perfectly and fully known, there is scatter in the M(ct„) 
scaling relation. This can be attributed to both physical 
effects and selection effects, including halo environment 


and triaxiality (e.g. White et al. 


2010 Saro et al. 2013 


Woitak||2013l ISvensmark et al.||2014|), projection ettects 

lum 


e.g. |Cohn||2012| |lNo h &: Cohn||2012|), m ass-dependent 
tidal disruption (e.g. Munari et al. 20131), the degree of 


relaxedness of the cluster (e.g. Evrard et al. 2008 Ribeiro 


et al. 


, and galaxy selection stra tegy (e.g. Old et al. 


2013 |Saro et al.|2013 Wu et al.|2013 ). Halos undergoing 


mergers or matter accretion possess a telltale w ide, flat 


velocity pr obability distribution function (PDF) (Ribeiro 
et al.|201lD . Impure, incomplete cluster membership cat- 
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alogs increase scatter in the M{(7y) relationship further. 
Reducing errors in cluster mass measurements is essen¬ 
tial for applying clusters as cosmological probes. 

The galaxy dynamics beyond the virial radius of the 
cluster is likewise informative, and nearby, unvirial- 
ized matter can also be used for cluster mass measure¬ 
ments. The caustic technique employs infalling ma tter 
and galaxy velocities to determine a mass profile (e.g. Bi- 


los 


viano fc Girardi|2003| Serra et al.|2011[ [Gifford fe Miller 


2013 1 and can be applied to determine cluster masses 


e.g. Rines & Diaferio 2006 Rines et al. 2013 Geller et al. 


20131, p erforming well eve n in the case ot merging ha¬ 


T-g- 


Rines et al. 20031. Further, the nonvirialized 


infalling matter beyond the virial radius provide s cues 


which can be used to infer a clu ster’s mass (e.g. Zu & 
Weinberg 2013 Falco et ar]|2014 |. 


A machine learning app roach to dynamical ma ss mea¬ 
surements was explored in Ntampaka et al. (2015). Here, 
we built on the virial theorem’s simple M(ay) power 
law to take advantage of the entire LOS velocity PDF 
for mock observations with pure and complete cluster 
membership information, using all relevant substructure 
within the i? 200 c of each cluster. Taking full advantage 
of the velocity PDF was achieved by applying a non- 
parametric machine learning (ML) approach to a PDF 
of LOS velocities from a mock cluster catalog. By em¬ 
ploying support distribution machines (SDMs), an ML 
class of algorithms that learns from a distribution to 
predict a scalar, the full velocity PDF was used to im¬ 
prove mass predictions. A traditional power-law scaling 
relation yielded a wide fractional mass error distribution 
(see equation!^ and extended high-error tails. SDMs 
trained on LOS velocities resulted in almost a factor- 
of-two reduction in mass errors compared to the tradi¬ 
tional approach, substantially reducing the number of 
severely over- and underestimated halo masses in the 
ideal case with pure and complete cluster membership 
information. 

However, the idealized catalog used in this case did 
not account for a primary source of error in dynamical 
mass measurements: interloper galaxies in the fore- or 
background of the true cluster, appearing to be cluster 
members. In an ideal cluster catalog, all cluster members 
are known (complete) and the observations contain only 
true members (pure). Cluster observations that are im¬ 
pure due to contamination by interlopers are su bject to 
additional s catter in the M(ct„) relationship (e.g. Mamon 


et al. 
oped 


20101, and a variety of methods have been devel- 
o remove interloper galaxies from the sample (e.g. 


dda et al.||1996l |von der Linden et al.||2007| |Mamon 
al.||2013' jfearsoii et al.||20r^ 


Fadda 

et 

in this follow-up paper, we explore how a more 
realistically-prepared mock catalog influences both the 
M(cr„) scaling relation as well as the SDM predictions 
of cluster mass. Cluster members are selected within a 
cylinder defined by a projected radius in the plane of the 
sky and a radial velocity along the line-of-sight. This 
technique produces a catalog of spectroscopic member 
catalogs that are impure, containing interloping galax¬ 
ies that appear to be cluster members but do not reside 
within the virial radius of the cluster. They are also in¬ 
complete, excluding some true cluster members from the 
sample. 

In Sec.[^ we discuss our methods: the simulation (2.1), 


mock observation (2.2), po werd aw scaling relation (2.3), 
and SDM implementation ( |^[ ) . Results are presented in 
Sec. and discussed in Sec. ffl We present a summary of 
our midings in Sec. Finally, we explore how changes 
to our mock catalogaffect power law and ML results in 
the Appendix (Sec.j^. 

2. METHODS 
2.1. Simulation 

The mock cluster catalog is created from the publicly 
available Multidark MDPLl simulatior(3 Multidark is 
an 7V-body simulation containing 3840^ particles in a 
box of length 1 h“^Cpc and a mass resolution of 1.51 x 
Multidark was run using the L-Cadget2 
code. It utilizes a ACDM cosmology, wi th cosmological 
parameters consiste nt with Planck data (Planck Collab- 
oration et al.|2014a|): = 0.69, Qm = 0.31, ilb = 0.048, 

h = U.68, n = 0.96, and erg = 0.82. 

Halos are identified by Multidark’s BDMW algorithm, 
which uses a bound density maximum (BDM) spherical 
overdensity halo finder with halo average density equal 
to 200 times the critical density of the Universe, denoted 
M. All halos and subhalos at redshift z = 0 with mass 
M > 10^^ Mq h~^ are included in our sample. For more 
information on the Multidark simulation and BDMW 


halo finder, see|Klyp i n fc H oltzman (|1997|); [Riebe et al. 
(2013); Klypin et al. l2014| and references therein. 


2.2. Mock Observations 

Two mock observations are created: Pure and Con¬ 
taminated. For each of these two mock observations, a 
train sample and a test sample are made. The Pure Cat¬ 
alog is ideal, in that all cluster members above Msuh = 
10^^ Mq h~^ within i ?200 are included in the catalog. The 
train catalog has a flat mass function, with 5028 unique 
halos with M > 10^^ Mq h~^. Halos in this catalog con¬ 
tributes multiple lines of sight each such that low- and 
high mass clusters are represented in equal measures. 
The test catalog has 2278 unique halos with a lower mass 
cut of M > 3 X 10^"^ Mq h~^, and each unique halo con¬ 
tributes exactly th ree lines of sight e ach. It is discussed 
in further detail in Ntampaka et al. (2015). 

In contrast with the Pure Catalog, the Contaminated 
Catalog includes more realistic observational selection ef¬ 
fects. It employs a simple, cylindrical cut around each 
cluster, allowing interlopers to contaminate the sample. 
As with the Pure Catalog, the Contaminated Catalog has 
both a train catalog with a flat mass function, as well as 
a test catalog that uses three lines of sight per cluster. 

The Contaminated Catalog is constructed in the fol¬ 
lowing way: each halo and subhalo is assumed to repre¬ 
sent an observable galaxy, with the galaxy inheriting its 
host’s position and velocity. A simple cut is made around 
each cluster, allowing for interlopers to contaminate the 
cluster observation. To allow for interlopers across the 
box edge, the entire simulation box is padded with a 
200 Mpe/i“^-thick slice from across the periodic bound¬ 
ary to make a cube with length 1.4Cpc/i“^. This cubic 
mock observation will be used to create a mock cluster 
catalog that incorporates known observational selection 
effects. 

® http://www.cosmosim.org/ 
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TABLE 1 

Catalog Summary 


Catalog Name 

Type 

Min. Halo Mass 
(Me /r-i) 

-^aperture 

(Mpc h~^) 

"^^cut 

(kms“^) 

Scut 

Projections per 
Unique Halo 

Total 

Projections 

0-15 

(kms“^) 

a. 

Pure 

Train 

1 X IQi'* 

- 

- 

- 

varies 

15000 

1244 

0.382 

Pure 

Test 

3 X lOi"* 

- 

- 

- 

3 

6834 

- 

- 

Pure 

High Mass Test 

7 X lOi"* 

- 

- 

- 

3 

945 

- 

- 

Contaminated 

ML Train 

1 X lOi"* 

1.6 

2500 

2.0 

varies 

15000 

- 

- 

Contaminated 

PL Train 

3 X IQi'* 

1.6 

2500 

2.0 

varies 

10213 

753 

0.359 

Contaminated 

Test 

3 X IQi'* 

1.6 

2500 

2.0 

3 

7449 

- 

- 

Contaminated 

High Mass Test 

7 X 1014 

1.6 

2500 

2.0 

3 

951 

- 

- 


Note. — For the Pure Catalogs, cluster radius and member galaxies are known. For further details on the creation of this catalog, see|Ntampaka| 
|et al.|j20T5ll. 


An intentionally-simplistic cylindrical cut is made 
around each cluster center. Only halos with M > 
10^'^ Mq h~^ with centers that reside within the original 
lGpc/i“^box volu me are considered to be “cluster can¬ 
didates.” Following|01d et al.|(|2014|), true cluster centers 
are assumed to be known by the observer. Following Wo-| 
jtak et al. (2007), the observer is placed 100 Mpc from 
the center ot the cluster along the chosen line-of-sight. 

The full 3D galaxy velocity and position information is 
reduced, then, to what can be observed along this line- 
of-sight: plane-of-sky x'- and y'-positions and LOS ve¬ 
locities. A galaxy’s net velocity, v, is given by the sum 
of the peculiar velocity plus the Hubble flow. An initial 
cylindrical cut defined by a circular aperture with radius 
7?aperture about the cluster center in the plane of the sky 
and a LOS initial velocity cut of Ucut about the expected 
bubble flow velocity of an object located at a distance of 
100 Mpc from the observer. 

The cylinder Aaperture and Ucut values are chosen to 
correspond with the radius and 2av, respectively, of a 
1 X 10^^ Mq h~^ cluster. The radius of a cluster of this 
mass is 1.6Mpc The 2cr„ is informed b y the best fit 
power law found in Ntampaka et al. (2015), giving twice 
a typical velocity dispersion ot true cluster members of 
2cr^ « 2500 kms“^ for a cluster of mass 1 x 10^^ Mq h~^. 
These parameters are noted in Table A more thor¬ 
ough exploration of how Aaperture and Ucut choices affect 
cluster mass predictions is presented in the Appendix 
(Sec. j^. 

This initial cylinder is pared iteratively in velocity 
space, with outliers beyond 2 (t„ of the mean velocity be¬ 
ing omitted from the sample. Here, (j„ denotes the stan¬ 
dard deviation of all LOS velocities of the galaxies that 
reside in the cylinder. This paring occurs until conver¬ 
gence is reached or until fewer than 20 members remain. 
Clusters with at least 20 members remaining are added 
to the cluster catalog. 

In order to create a representative training sample 
of how the rare, high-mass clusters might appear when 
viewed from any direction, the entire box is rotated and 
this process is repeated. The first three rotations are 
chosen so that the observer views along the box x- , y-, 
and ^-directions. The remaining rotations are chosen 
randomly on the surface of the unit sphere. To create 
the Contaminated Train Catalog, 1000 such rotations are 


performed. 

The Train Catalog includes halos with M > 1 x 
10 ^^Mq/i“^. It is created with a flat mass function, 
such that there are exactly 1000 training clusters in each 
Q.ldex mass bin. In bins with fewer than 1000 clusters, 
this is done by assembling many LOS views of rare ha¬ 
los. In mass bins with more than 1000 clusters, clusters 
are rank ordered by mass and evenly removed from the 
training sample. 

In contrast with the Contaminated Train Catalog, the 
Contaminated Test Catalog contains exactly three LOS 
views of every halo: the box x- , y- , and z-directions. 
Because boundary effects are expected near the edge of 
the training sample, a minimum mass cut of M > 3 x 
10^^ Mq h~^ is applied to the test catalogs. The single 
most massive halo has a mass that will necessarily lie 
outside of the training sample, and therefore is omitted 
from the test catalogs as well. 

In summary, the Contaminated Catalog is created in 
the following manner: 

1. All halos and subhalos with mass greater than 

are assumed to represent a galaxy, 
with the galaxy inheriting its host’s position and 
velocity. 

2. Halos with mass greater than 10^“^ Mq h~^ are con¬ 
sidered “cluster candidates.” 

3. A cluster candidate’s center is assumed to be 
known, and an observer is placed 100 Mpc from the 
cluster. 

4. All galaxies in the box are given an appropriate ve¬ 
locity that includes both Hubble flow and peculiar 
velocities. 

5. A cylinder is cut around the cluster candidate cen¬ 
ter; this cylinder is defined by an aperture radius, 
7?aperture, and a LOS velocity cut, Ucut- 

6. Calaxies outside of mean galaxy velocity ± 2cri, are 
iteratively removed from this cylinder until conver¬ 
gence is reached. 

7. This is repeated for all massive halos in the box, 
and those with at least 20 members remaining are 
kept in the sample. 
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Fig. 1.— Top: Average distribution of galaxy LOS velocities from stacked clusters in three log[M (Mq bins, in increasing mass 
from left to right. While the Pure Catalog (green dashed) consists solely of galaxies residing within the virial radius of the cluster, the 
Contaminated Catalog (blue solid) contains contaminating interlopers (red dotted) and excludes some true cluster members. In the top 
right panel, the exclusion of true cluster members is evident where the blue solid line dips below green dashed. Bottom: Average distribution 
of galaxy projected radii from the cluster center. Both uios and R distributions change shape and amplitude with cluster mass, even for 
the Contaminated Catalog; this mass-dependent shape can be exploited by a distribution-to-scalar ML technique to learn cluster masses 
from distributions of data like the examples shown here. 

8 . The box is rotated, and steps |3][^ are repeated. learning technique. 


9. The Contaminated Train Catalog is made of mul¬ 
tiple LOS projections, up to 1000 for the highest- 
mass cluster. The number of projections per unique 
halo is chosen to create a flat mass function for the 
Train Catalog. 

10. The Contaminated Test Catalog is made of the first 
three (x- , y- , and z-directions) views of all halos 
above M = 3 x lO^"* Mq/ i“^. The most massive 
halo is also excluded from the Test Catalog. 

Figure [l] shows the average xios and R distributions for 
the Train Catalogs, divided into three log[M(MQ h“^)] 
bins. The Pure Catalog is pure, in that there are no 
interlopers contaminating the galaxy clusters. It is also 
complete, in that all galaxies within the cluster i ?200 £^re 
known. In contrast, the Contaminated Catalog includes 
interlopers and excludes some true cluster members. The 
shape of xios and R distributions are mass-dependent, 
and this dependence on cluster mass can be utilized in 
mass predictions. In Sec. |2.4[ we will explore ways to 
predict cluster mass by exploiting these mass-dependent 
distributions using a distribution-to-scalar machine 


2.3. Power Law 

In a typical power-law scaling relation, one starts with 
the virial theorem to find a relationship between the ve¬ 
locity dispersion, cr„, and halo mass, M. This power 
law is given as oc but can be rewritten more 

generally as 

( io^mL-O ■ <'> 

where <715 is the typical velocity dispersion of galaxies re¬ 
siding within a 10^® Mq h~^ halo and the parameter a is 
allowed to vary from the theoretically-predicted a = 1/3 
and is instead fit to data. The best fit is then be used to 
predict cluster mass from a velocity dispersion of galax¬ 
ies. When applied to the Pure Catalog, this method will 
be denoted PLp, and when applied to the Contaminated 
Catalog, it will be denoted PLc. 

To account for a potentially-changing slope caused by 
the cylindrical cut used for the Contaminated Catalog, a 
lower mass cut of 3 x h~^ will be applied to the 

data used to fit the power law. We find a least-squares 
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log[M (Mg h-^)] 

Fig. 2.— Velocity dispersion, <7v, vs. cluster mass, M, for a 
simple cylindrical cut with iterative 2-a paring. Clusters above 
3 X 10^*^ Mq h~^ (vertical black dash dotted) inform the fit (black 
solid) and determine the lognormal scatter (68% and 95%, dashed 
and dotted, respectively). The presence of interlopers introduces 
significant scatter, particularly at low masses, where the effect of 
interlopers is more pronounced. 

fit to log(CT^) = alog(M) + /3 for the PL Train Catalog. 

While PLp is well-described by a = 0.382, (T 15 = 
1244 km s“^, PLc has a shallower slope and smaller ve¬ 
locity dispersion expected for a halo, a = 

0.359 and uis = 753kms“^, respectively. These best 
fit parameters to the M{ay) power law (Equation!^ for 
each catalog are noted in Table [l] The scaling relation 
best fit for the Contaminated Catalog is shallower and 
has a smaller 1715 compared to that of the Pure Cata¬ 
log, therefore, applying the PLp fit to observed clusters 
with interlopers can introduce additional error. We ad¬ 
ditionally caution that these parameters are a fit for a 
particular simulation and cylindrical cut and should be 
applied to observational data with care. 

The introduction of interlopers is a large source of 
scatter in M(cr„). Figure shows a two-dimensional 
histogram of vs. M for ^e Contaminated Catalog. 
Overlaid is a best fit with 1- and 2-tj lognormal errors 
calculated for clusters with mass above 3 x 10^^ Mq h~^ 
and extrapolated down to lower masses. This lognormal 
scatter, Cgauss) is determined by the standard deviation 
of the residual, 6, defined as 

S = log((Tiaeasured) lo§(f^expected)- (2) 

Here, CTmeasured is the velocity dispersion of the galaxies 
within the pared cylinder and tTexpected is the typical ve¬ 
locity dispersion expected for a cluster of a given mass, 
found by applying Equation with true cluster mass 
M and best fit parameters uis and a. Of halos with 
M > 3 X 10^^ Mq h~^, 1% reside above the -|-2 ct dotted 
line and 4% reside below the —2a dotted line. However, 
of halos with 1 x 10^'^ Mq h~^ < M < 3 x 10^"^ Mq h~^, 
8 % reside above -|-2tT and 4% below —2 ct. The scatter 
found for the higher-mass clusters is clearly not descrip¬ 
tive of the lower-mass clusters; this is explored further in 
the Appendix (Sec.[^. 

The PLp and PLq approaches rely on a single sum¬ 
mary statistic, ay, to describe the dynamics of the clus¬ 


ter members. However, mergers and infalling matter, 
for example, can distort the shape of the velocity PDF 
and cause the cluster’s mass to be overpredicted by a 
traditional power-law approach. Next, we will explore a 
machine learning approach for predicting cluster masses 
that learns from a distribution, rather than from a single 
summary statistic. 

2.4. Support Distribution Machines 


Support distribution machines (SDMs; Sutherland 
et al. 2012 ) are a class of machine learning al gorithms 
built upon Support Vector Machines (SVMs; lDrucker| 
et al. 1997 Scholkopf & Smola 2002|). Given a train¬ 
ing set of (distribution, scalar) pairs, the goal of SDM is 
to learn a function that predicts a scalar from a distri¬ 
bution. They will be applied here to learn from distribu¬ 
tions of galaxy observables such as galaxy LOS velocity 
and projected distance from cluster center. These distri¬ 
butions of galaxy observables will then be implemented 
to predict the log of the cluster mass, log(M). 

The SDM method applied requires the divergence be¬ 
tween pairs of distributions in the training and test sets. 
For this purpose, we employ the Kullback-Leibler (KL) 
diver gence, and es t imate the divergence via the estimator 


from Wang et al. (2009). This is a k-nearest-neighbor- 
based estimator. In practice, we use k=3. The relative 
divergences from training data are used to select SDM 
best fit kernel parameters C and a, the loss function pa¬ 
rameter and Gaussian kernel parameter, respectively, via 
3-fold cross-validation. These are used to train the re¬ 
gression model with the selected best-fit kernel, which 
in turn is used to predict masses for the test data. For 
a full discussion of SVM formalism as well as a discus¬ 
sion of how SDM deviates from the SVM base case, see 


Sutherland et al. (2012) and Ntampaka et al. (2015). 

In order to take full advantage of the available data, 
we cyclically learn from 90% of the clusters and predict 
masses from the remaining, independent 10 %; this is re¬ 
peated ten times until the masses of all clusters in the 
Contaminated Catalog have been predicted. To prepare 
the mock cluster catalog for SDM implementation, clus¬ 
ters are rank-ordered by mass and sequentially assigned 
to one of ten folds. Multiple LOS views of a unique clus¬ 
ter are all assigned to the same fold, ensuring that each 
time SDM is implemented, a unique cluster is used either 
for training or for predicting, but never both. 

Of the ten folds, nine from the Contaminated Train 
Catalog are used to select SDM best fit kernel parame¬ 
ters C and a and subsequently train the regression model 
with the selected kernel. This regression model is then 
used to predict the masses of the clusters in the tenth 
fold of the Contaminated Test Catalog. The process is 
repeated ten times, training on nine Train Catalog folds 
and predicting the tenth Test Catalog fold, until masses 
for the entire Contaminated Test Catalog have been pre¬ 
dicted. 

We implement SDM with four sets of galaxy features: 
the PDF of galaxy LOS absolute velocity (|wios|), the 
PDF of normalized velocity (Iwiosl/ct,), the PDF of pro¬ 
jected distance from the cluster center (R), and combi¬ 
nations thereof. As discussed in Ntampaka et al. (2015), 
features must be chosen with care because teatures un¬ 
correlated with mass tend to wash out the effects of the 
more important features. The motivation for features 




































6 


Ntampaka et al. 


TABLE 2 
Feature Summary 


Case 

Approach 


Train Sz Test Catalogs 

Summary Stats 

Distribution Features 

Color 

PLp 

Power Law 


Pure 

<7 y 

— 

Red 

PLc 

Power Law 


Contaminated 

<7 y 

— 

Blue 

MLv 

Machine Learning: 

SDM 

Contaminated 

— 

I'^^los 1 

Green 

MLr 

Machine Learning: 

SDM 

Contaminated 

— 

R 

Orange 

MLv,r 

Machine Learning: 

SDM 

Contaminated 

— 

hiosi & R 

Brown 

MLv,ct,R 

Machine Learning: 

SDM 

Contaminated 


hlosl, IDosI/o-v, & R 

Purple 



R (Mpc h -^) 

Fig. 3. — Top: Average number of galaxies per unit plane-of- 
sky area, dN/dA, vs. projected distance from the center of the 
cluster, R, for three log[M(M 0 h~^)] ranges in the Contaminated 
Test Catalog, in 0.1 Mpch~^ bins. The shape and amplitude of 
this effective column density vary with the mass of the primary 
halo. Bottom: Probability of finding a galaxy per unit area, dpjdA, 
vs. R. The shape and amplitude of this measure also varies with 
primary halo mass. Arrows denote the characteristic radius of a 
halo with log[M(M 0 h~^)] indicated. SDM trained on the feature 
R takes advantage of how the distribution of subhalo radius changes 
with mass to predict a halo mass based on the distribution of R. 

implemented here is as follows: 


ter m ass (e.g. Hansen et al. 2005 JPearson et al. 


2015|). This is motivated by l''igure|3[ which shows 
stacked halos from the Contaminate Test Catalog 
divided into three log[M(MQbins. Despite 
the fixed aperture, the number of galaxies per unit 
plane-of-sky area (dN/dA) in concentric rings has a 
markedly different distribution for the low, middle, 
and high-mass halos. The probability of finding a 
galaxy per unit plane-of-sky area (dp/dA) also ex¬ 
hibits a unique shape for each mass bin. For this 
reason, we will consider an MLr catalog, with the 
galaxy radii from the halo center, R, as the sole 
feature. 


3. MLv,r: Decreasing velocity dispersion profiles have 


been noted in clusters (e.g. Rines et al. 

2003 

). Be- 

cause vios and R individually can provic 

le intorma- 


tion about cluster mass, it seems reasonable that 
the joint probability distribution of |uios| and R 
may be informative as well. MLv,r will learn from 
the joint distribution of the LOS velocity feature, 
|uios|, and the galaxy radius feature, R, in a two- 
dimensional feature space. 


• r: The shape of the velocity PDF can be 


4. MLv,,., 

indicative of mass accretion a nd mergers (l Evrard 
__e^alj2008 R|^d ro et aH20lT ). As found ini^ tam-| 


paka et ai.| (2015), explicitly normalizing uios by its 
width, ay, can emphasize these shape differences 
and improve mass predictions, particularly at the 
high-mass end. We will consider a training set, 
MLv.ct.r, that employs |uios|, |uios|/ct«, and i? in a 
three-dimensional features space. 


These ML method names and corresponding distribution 
features are summarized in Table |2] for reference and 
will be used by SDM to predict cluster masses. Next, 
we will explore how the PL’s scaling relation and ML’s 
distribution-to-scalar approach predicted masses of clus¬ 
ters from the mock cluster catalog. 

3. RESULTS 
3.1. Power Law 


1. MLv: The use of velocities is motivated W the 
virial theorem, as we have seen in Figure that 
velocity dispersion of galaxies, <jy, relates to mass 
as a power law, albeit with significant scatter. The 
MLv catalog uses absolute value of galaxy LOS ve¬ 
locities, |uios|, as a single feature for training and 
testing by means of SDM. 


Figurej^shows the predicted vs. true cluster masses for 
the Pure and Contaminated Catalogs. When a power law 
is applied to the Pure Catalog, there is significant scatter 
in mass predictions. The bottom panel of Figure|^shows 
the median and 68% scatter in the fractional mass error, 
e, given by 

e = (Mpred - M)/M, (3) 


2. MLr: Even in the presence of interlopers, galaxy 
density profiles can be used to determine clus- 


where M is the true cluster mass and Mpred is the pre¬ 
dicted cluster mass. The scatter in PLp errors can be 
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Fig. 4. — Left: Power-law scaling relation applied to the Pure Catalog (method PLp). Predicted vs. true mass, binned in 0.1 dex 
log[M (Mq h~^)] bins, with mean (black solid), median (red solid), 68% (dashed), and 95% (dotted) scatter shows that significant scatter 
exists even when applying a scaling relation to a catalog of pure and complete clusters (top). Though the mass error median (red solid) is 
nearly zero (gray solid), it has significant 68% scatter (red dashed) (bottom). Right: Power law scaling relation applied to the Contaminated 
Catalog, which contains impure and incomplete clusters (method PLc). The imperfect catalog introduces additional scatter in e compared 
to the PLp case, most notably at low masses where the sample impurity is particularly pronounced. These two plots provide best (left) 
and worst (right) case scenario benchmarks for applying an M((T„) power law scaling relation to cluster observation. 


attributed to both physical and selection effects. For ex¬ 
ample, infalling matter tends to create a velocity PDF 


with negative kurtosis, tending to overpredict the mass. 

Cluster mergers 

lEvrard et al. 12008 

), galaxy selection ef- 

fects (Saro et al. 

20131), and dynamical friction and tidal 

disruption (jiVlunari et al. 

2013) can each play a role in 


Figure also shows results for the power-law scaling 
relation applied to the Contaminated Catalog. Impure 
and incomplete clusters introduce further scatter and er¬ 
rors increase significantly. This scatter is most notable 
at the low-mass end, where the inclusion of interlopers is 
most prominent. 

PLp and PLc serve as upper and lower bounds for 
errors for a power-law scaling relation: PLp’s pure and 
complete clusters show the level of scatter that remains 
when interlopers are completely eliminated, while PLc’s 
simplistic interloper removal technique highlights how in¬ 
terlopers can affect scatter in an extreme case. More 
effective interloper removal methods are available, ap- 
plying more discr i minating statistical techniques (e.g. 
Fadda et al.||199(i{ |von der Linden et al.||2007| |Mamcm| 


et al. 2013p, with some considering only red elli ptical 


galaxies wh ich preferentially reside in clusters (e.g. Saro 
et al.|2013 ). We expect a more refined interloper-rernoval 


scheme to reside between the two benchmark cases shown 


in Figure]^ 

One may consider the possibility of improving mass 
predictions by extending mass range for training. How¬ 
ever, due to the existence of many high-error, high-cr^ 
clusters shown in Figure decreasing the lower mass 
limit may not improve mass predictions. Even with¬ 
out this high-error population, the power law dynamical 
mass approach has significant scatter exacerbated by the 
presence of interlopers. Further, the potentially infor¬ 
mative infalling galaxy observations have not been con¬ 
sidered, nor have the baseline LOS velocity PDF shapes 
indicative of a nonvirialized or merging system. Next, we 
will explore the results of learning on full distributions 
with a machine learning approach. 

3.2. Machine Learning 

Figure shows the SDM predictions for each of the 
four feature sets: MLv, MLr, MLv,r, and MLv,ct.r- As 
in Figure [4 the top panel shows predicted vs. true mass 
median with 68% and 95% scatter. Each of the ML meth¬ 
ods reduces scatter significantly compared to PLc, the 
power law that is applied to the same catalog as these 
ML methods. One should not overly interpret the fluctu¬ 
ations in the two largest mass bins, as they contain only 
six unique clusters, a small fraction of the total clusters 
in the sample. The bottom panel shows median error e 
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Fig. 5.— Top Left: SDM results for MLv (green). The predicted vs. true mass is binned in 0.1 dex log[M(MQbins. 
Mean (black solid), median (colored solid), 68% (dashed), and 95% (dotted) scatter are shown (top). The median error (solid) 
and error 68% scatter (dashed) are also shown (bottom). MLv gives better than a factor-of-two reduction in the width of error 
compared to a standard scaling relation applied to the same catalog. Top Right: SDM results for MLr (orange). MLr, and 
MLv minimize the width of the error distribution. Bottom Left: SDM results for MLv,r (brown). MLv.r underpredicts at high 
masses and is therefore identihed as a disfavored method. Bottom Right: SDM results for MLv,cr,R (purple). MLv,ct,r minimizes 
the tendency to underpredict across mass range. 
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Fig. 6 .— Top: Error 16th and 84th percentiles (i.e. 68% scatter) as a function of mass for MLv (green) as compared to a power-law 
approach applied to the Pure Catalog (PLp, red) and to the Contaminated Catalog (PLc, blue). Bottom: Error scatter as a function of 
mass for MLv,(t,r (p^rple) compared to PLp and PLc- The errors of a dynamical mass power-law approach with a more refined interloper 
removal scheme should be bounded by PLq and PLp. However, even when trained on the impure and incomplete catalog that produced 
the blue PLc results, MLv and MLv,<t.r have e width comparable to or smaller than the best case PLp power law. 


(see Equation]^ with 68% scatter. The 68% scatter is 
dramatically rMuced compared to the power law rela¬ 
tion with the same catalog, PLc, and is comparable to 
the power law relation with a catalog of pure and com¬ 
plete clusters, PLp. MLv,cr,R has median binned mass 
predictions that are closest to the true mass, while MLp 
has the smallest error width, but all four ML methods 
outperform PLc by a large margin. 

A comparison of mass predictions is presented in Fig¬ 
ure PL provides two benchmarks: while the PLc 
error shows what we might expect from a impure and 
incomplete interloper catalog, PLp gives a best-case sce¬ 
nario where cluster members are perfectly known and 
interlopers are entirely excluded. Across the entire mass 
range considered, MLv and MLv.o-.r exhibit a dramati¬ 
cally tighter error distribution than a power law applied 
to the Contaminated Catalog. Even in comparison to the 
Pure Catalog, SDM produces a tighter error distribution. 

Figure shows a PDF of errors for all clusters above 
3 X and for those above 7 x 

The PLc curve shows the PDF of errors associated with 
M(cr„) power law with the Contaminated Catalog’s sim¬ 


ple cylindrical cut about cluster centers. In contrast, the 
PLp curve shows the PDF of erros associated with the 
M(cr^) power law of the Pure Catalog, built from per¬ 
fect knowledge of cluster members. For both MLv and 
AILv,cr.R, the number of extreme overpredicted masses 
with e > 0.6 is dramatically reduced over even the PLp 
power law. The extreme underpredicted masses with 
e ^ —0.6 are reduced compared to PLc. 

The mean error (e) and median with central 68% width 
(e± Ae) of these PDFs are summarized in Table Here 
we see PL’s tendency to overpredict (positive e and e) 
in contrast with ML’s tendency to underpredict (neg¬ 
ative e and e). ML’s underpredictions are caused by 
the hard upper mass limit and dearth of unique train¬ 
ing halos at the high-mass end. The resulting under¬ 
prediction is most conspicuous in MLv (both the Con¬ 
taminated Test and Contaminated High-Mass Test) and 
in MLv,r (Contaminated High-Mass Test only). MLv 3 
has the smallest error offset (-0.04), but does so at the 
cost of underpredicting the highest-mass clusters. This 
bias is most evident at the higher mass end, where ha¬ 
los’ masses are systematically underpredicted. Because 
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Fig. 7.— Left: PDF of fractional mass errors for the Full Test Catalogs. A power-law M{a„) scaling relation for a catalog of pure 
and complete clusters shows significant errors (PLp, red solid). The error distribution widens further when interlopers contaminate the 
clusters (PLc, blue dashed). Remarkably, SDM (MLv, green dotted, and MLv,o-,R) purple dash dotted) applied to the Contaminated 
Catalog outperform the M((t„) scaling relation applied to the Pure Catalog. Center: PDF of errors for the High-Mass Test Catalogs 
(M > 7 X 10^^ Mq h~^) shows a similar trend for rare, high-mass halos; the ML approaches minimize error significantly over a power-law 
scaling relation applied to the same catalog. Right: PDF of the high-(S, high-PLc-error population of clusters. While the power law 
catastrophically overestimates the masses of these outlying objects, ML approaches perform well, with a PDF of fractional mass errors for 
these outliers that is only slightly wider than is found for the full catalog. 


of this pronounced bias, MLv,r is therefore identified as 
a disfavored method. 

The relative error widths (Ae) for all ML methods 
for all methods are more than a factor-of-two smaller 
than PLc (69%, 69%, 58%, and 64% for MLv, MLr, 
MLv,r and MLv,o-,r, respectively). Even compared to 
PLp which is applied to the Pure Catalog, SDM pro¬ 
duces a smaller relative error width (23%, 23%, 3%, and 
12 % for MLv, MLr, MLv r and MLv.^^r, respectively). 

As we saw in Figures!^ andthere is a wide scatter 
in (T„ associated with the Contaminated Test Catalog. 
Shown in the right panel of Figure!^ are the clusters for 
which PLc severely overestimated uuster mass. These 
objects are particularly worrisome, as are predicted by 
PLc as being much more massive than they truly are, 
appearing to be rare, high-mass clusters. These outliers 
are isolated by their residual, 5 (Equation]^; each has 
5 > 1.5 X CTgauss- We find the ML error PDF for these ob¬ 
jects is centered on zero, with a PDF width only slightly 
wider than the one shown in the left panel of Figure Id 
for the full catalog. Further, while the PLc method over¬ 
predicts catastrophically, the ML methods predict much 
more reasonable masses. 

Figure shows a comparison of the five methods ap¬ 
plied to the Contaminated Catalog: PLc, MLv, MLr, 
MLv,r, and MLv,(t,r- The difference in absolute errors, 
denoted Icrowl ~ Iccoiumnlj gives a measure of how well the 
row method predicts compared to the column method; 
values below 0 are indicative of the row method predict¬ 
ing more accurately. The left column of this plot shows 
a comparison of ML to PL; all four ML methods con¬ 
sistently predict masses with a much smaller error than 
PLc. The mean difference in absolute value of errors, 
denoted |e| — |epLi|, is summarized in TableThis sum¬ 
mary statistic quantifies the mean value shown in the left 
column of Figure The more negative this value, the 
more reduced a model’s errors compared to PLc. Model 
MLr decreases error e by an average of 0.61 compared to 
PLc; MLr is the best ML method by this measure. The 


right three columns of Figure compare the ML tech¬ 
niques to one another. MLv,r is shown here to be the 
weakest of the ML methods; though it outperforms PLc 
by a large margin, SDM produces more accurate mass 
predicti ons when applied with o ther feature sets. 


As in Ntampaka et al. (2015), pairing |uios| with the 


feature fuiosl/Ci, accentuates differences in velocity PDF 
shape and highlights, for example, the wide, flat hallmark 
PDF of a halo experiencing infalling matter. As a result 
of this additional feature, the mean and median errors 
edge closer to the desired values of zero. This offers an 
explanation as to why the three-feature set of MLv, 0-3 
shows a mean error closer to zero ( 0 . 01 ) compared to 
MLv and MLr. MLv,ct,r is identified as the preferred 
feature set for minimizing error bias. 

Though MLv,r employs two features that are highly- 
correlated with mass, these features reside in a two- 
dimensional feature space. The joint distribution of |uios| 
and R is likely too sparsely sampled by the galaxies in an 
individual cluster to make a strong correlation between 
this joint distribution and cluster mass. This effect be¬ 
comes particularly pronounced for rare, massive clusters, 
which are underpredicted by MLv,r. 

MLv,cr,R, however, predicts the masses of these clus¬ 
ters well. This may be explained by the nature of the 
third feature, Ifiosl/ci,. Though the probability distribu¬ 
tion employed by MLv.o-,r resides in a three-dimensional 
feature space, the combination of juiosl with |wios|/< 7 „ con¬ 
strains individual clusters’ distributions to lie on a plane. 
These planes are sorted in the three-dimensional space by 
their slope, cr^. This sorting effectively isolates high-CT„ 
clusters from low-cr« ones. As we have seen with PLc, 
cr„ is a predictor of mass, albeit with significant scatter. 

By taking advantage of the full LOS velocity and pro¬ 
jected radius distributions, the SDM approach to deter¬ 
mining cluster mass from galaxy observables reduces the 
distribution of errors by roughly a factor of two, and also 
predicts masses well even in the cases where PLc catas¬ 
trophically overpredicts, making it a valuable tool for 
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Fig. 8. — Summary comparison of the five methods trained and tested on the Contaminated Catalog, with difference in absolute error, 

I Crow I — lecoiumnl^ a function of mass (see Equation [^. Values below the solid black 0 line indicate that the row method is performing 
better than the column method for a given mass bin. The left column summarizes a comparison of the four new SDM methods to the PL^ 
power law; SDM with any of the four feature combinations improves mass predictions in all mass bins. While MLv,r outperforms PLc, it 
performs poorly at high masses compared to the other ML methods. 


probing cosmological models with observations of galaxy 
clusters. 

4. DISCUSSION 

Reducing errors and eliminating biases in cluster mass 
measurements are crucial to utilizing clusters to discern 
and constrain cosmological models. The halo mass func¬ 
tion and its evolution are sensitive to cosmo logical pa- 
rameyrs such as as, PdEi and w (e.g. Schuecter 


et al.poioj |Mantz et al.||20l6 Allen et al.j20il|). 

tore, accurate measurernents ot cluster abundar 


et ah 120031 [Henry et al.|2009[ Vikhlinin et ah|20(J9f Rozo 


There- 

undance as a 

function of mass and redshift can be used to understand 
the underlying cosmology. The limiting factor in con¬ 
straining parameters and evaluating cosmological models 
with cluster counts, however, is in accurately connecting 
galaxy observables to halo mass to reproduce the halo 
mass function. 

Figure!^ shows how the scatter and biases in each 
model aff^t the halo mass functions recovered by PLp, 
PLc, MLv, and MLv,p (SD M applied to the pure c ata- 
log with feature |wios|, as in|Ntampaka et ah (2015)) in 


comparison to the simulation’s true mass function. The 
scatter about the scaling relation in PLp coupled with 
the rapidly-declining shape of the mass function causes 
the abundant, low-mass clusters with high <5 to popu¬ 
late the high-mass bins in the mass function, causing the 
upscattering at high masses. This effect is exacerbated 
in PLc, where the scatter about the scaling relation is 
much larger and the high-d clusters may be catastroph¬ 
ically overpredicted (as sho wn in Figure 1^. This effect, 
known as Eddington bias (Eddington TMS), alters the 


shape and amplitude of the measured halo mass func¬ 
tion from the true value. This results in PLc’s measured 
mass function dramatically overreporting the number of 
high-mass clusters. 

Any cosmological analysis of the HMF that employs 
such mass measurements must correct for this upscat- 
ter at high masses. Understanding the nature of the 
intrinsic scatter and observational selection effects is a 
crucial step to correct the observed HME for Eddington 
bias. Analytic approaches exist to correct for the sim- 
ple case of lognormal scatter (e.g. Mortonson et al.|2011 
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TABLE 3 

Method Comparison 


Case 

Summary 

Color 

Catalog 

el 

e± Ae2 

Ae3 1 

d-kpLcI^ 

PLM 

M{(Tv) Power Law, Pure 

Red 

Test 

0.128 

f •UO_o 3g 

0.871 

— 




High-Mass Test 

0.093 

O.UZ_o.29 

0.731 

— 

PLc 

M{(Tv) Power Law, Contaminated 

Blue 

Test 

0.508 

0 .iO_Q 73 

2.131 

— 




High-Mass Test 

0.409 


1.829 

— 

MLv 

ML with uios 

Green 

Test 

-0.052 

— 0 

'^•-‘^^-0.27 

0.670 

-0.63 




High-Mass Test 

-0.059 

-O.IOIO;^® 

0.686 

-0.47 

MLr 

ML with R 

Orange 

Test 

-0.016 

-0.08l°i| 

0.670 

-0.64 




High-Mass Test 

-0.040 

_n 1 

0.635 

-0.49 

MLv,R 

ML with |uios| R 

Brown 

Test 

0.078 


0.899 

-0.54 




High-Mass Test 

-0.032 


0.783 

-0.42 

MLv,a-,R 

ML with hiosl, hiosl/o-v, & R 

Purple 

Test 

0.011 


0.763 

-0.61 




High-Mass Test 

-0.044 

o.uy_o 29 

0.649 

-0.49 


^ Mean fractional mass error. 

^ Median fractional mass error ± 68% scatter. 

^ Width of e 68% scatter. 

^ Mean difference between model and PLc errors 


Evrard et al. 2014), while a more complicated scatter 
may be more ditticult to correct. Before correction for 
Eddington bias, the large scatter and errors associated 
with traditional power-law mass measurements lead to 
the failure to recover the true mass function, which lim¬ 
its the constraining power of dynamical mass measure¬ 
ments of galaxy clusters. PLp’s altered shape mimics 
the mass function of a simulation with a higher erg and 
Hm • This is particularly pronounced in the fractional dif¬ 
ference, Ay/y, between the Multidark and mock HMEs, 
which shows that the presence of interlopers causes the 
PL BMP to deviate from the simulation HME, particu¬ 
larly at high masses. 

At the low mass end, the underabundance of clusters 
is not caused by Eddington bias, but is an artifact of 
the hard lower mass limits of the test catalogs. This 
downscattering should not be interpreted as a dearth of 
low-mass clusters predicted by the PL and ML methods, 
but rather as a limitation of the test catalogs. 

In addition to the halo mass functions from the meth¬ 
ods highlighted in this work, mock HMEs that include 
scatter of other common cluster mass measurement tech¬ 
niques are included for comparison. Cluster masses can 
be deduced from a variety of techniques, and here we 
show three different methods for determining cluster 
mass: the Sunyaev-Zel’dovich (SZ) effect, weak gravi¬ 
tational leasi ng (WL), and x-ray. The S Z effect, hrst 
proposed by Sunyaev & Zeldovich (1972) can be used 
to determine a temperature-weighted g as mass, and we 
model its intrinsic scatter according the [Battaglia et al.| 
(2012) scaling relation for z = 0 with AGN feedback. 
Weak gravitational lensing probes structure along the 
line-of-sight, an d we model scatter in this technique ac¬ 
cording to the Becker & Kravtsov (2011) prescription 
for z = 0.25, M 500 C > 2.0 x IO^'^Mq/i”^ clusters. X- 
ray observations can be used to infer a gas mass pro¬ 


file; scatter in this M — Yx relation of uinM = 0.06 is 


adopted from Eabjan et al. (2011), and it should be noted 
that this is intrinsic scatter and does not include obser- 


vational effects. The mass-concentration relation from 

Bhattacharva et al. 

(2013) and the NEW density profile 

from Navarro et al. 


) are implemented to convert 


all masses to for comparison. 

Eigure shows the halo mass functions recovered by 
SZ, WL, and x-ray methods compared to the range of 
scatters achievable with SDM: MLv, p with a pure and 
complete cluster membership catalog and MLv with a 
large cylindrical cut around each cluster allowing many 
interlopers. It should be noted that the HME pre¬ 
sented assumes a complete large mock observation of 
6834 (7449) clusters in the Pure (Contaminated) Cat¬ 
alog. Eigure also shows the Poisson error associated 
with a more reasonable observation of 500 clu sters. Cur¬ 
rent cluster surveys (e.g. de Haan et al. contain on 

the order of hundreds of clusters, and the cnoice of 500 
clusters is chosen to show the errors accessible through 
current catalogs. Note that the small number of high 
mass objects limit the accuracy with which the tail end 
of the HME can be det ermined. As is shown in, e.g.. 


Ntampaka et al. (2016), a binned HME has the most 
power to resolve crg-11^ models at the lowest masses be¬ 
cause, while high-mass clusters are sensitive to changes 
in these cosmological parameters, the Poisson error bars 
on these rare objects dominates. For the mass ranges 
where the HME can best resolve changes in ag and Hm, 
SDM produces a competitive HME to these other mass 
proxies, though it has a larger deviation from the true 
HME at the high mass tail. 

However, it should be noted that these cluster mass 
methods utilize different wavelength observations with 
different systematic errors, biases, and limitations. 
Therefore, while Figure shows that five different clus- 
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Fig. 9. — Halo mass functions of dynamical cluster mass estimates with intrinsic scatter only (Pure Catalog) and intrinsic scatter plus 
observational selection effects (Contaminated Catalog). Any scatter in the mass-observable relationship, if uncorrected, will affect the 
observed halo mass function. The large scatter associated with the power-law scaling relation (PLp, red squares, and PLc, blue circles) 
causes an upscatter at high masses, while ML methods (MLv,p, purple stars, and MLv, green triangles) have a smaller intrinsic scatter 
and more accurately reproduce the true Multidark cluster abundance (dark gray solid curve). While 6834 (7449) clusters contribute to the 
HMF for the Pure (Contaminated) Catalog, a more moderate observation of 500 clusters yields larger Poisson error bars (light gray band). 
Right: HMF of ML methods compared to mock HMF with the typical intrinsic scatter of Sunyaev-Zel’dovich (pink diamond), weak lensing 
(brown x), and x-ray (orange octagon) cluster masses. The biases and the observational effects associated with SZ, WL, and x-ray masses 
may introduce additional scatter, causing the HMF to deviate further from the simulation HMF. 


ter mass techniques - PL, ML, SZ, x-ray, and WL - in 
a direct comparison, it should not be overly interpreted 
as a definitive guide to cluster mass measurement. For 
example, weak lensing is difficult and expensive to apply 
to high redshift clusters due to a lack of adequate back¬ 
ground galaxies. Biases in x-ray and SZ cluster masses 
may arise be c ause of nontherma l pressure suppor t (e.g. 


Evrard 1990 Rasia et al. 2004 Lau et al. 2009) (this 
bias is not modeled in higurej^ecause this effect is typ¬ 
ically corrected for, though uncertainty in the bias may 
produce further disagreement between observed and true 
HMF). When SZ masses are calibrated on simulation, the 
calibration is dependent o n correct modeling of the gas 
physics (e.g. Nagai 2006 Battaglia et al. 2012), which 
may also introduce a bias. 

Dynamical and ML masses, however, can be directly 
compared as they are produced from the same data from 


the same mock catalog and are affected by the same ob¬ 
servational selection effects. From their direct compari¬ 
son, it can be concluded that the ML method presented 
in this work is more competitive than a power-law scal¬ 
ing relation for decreasing errors in cluster mass mea¬ 
surements. While MLv over predicts the abundance of 
high-mass clusters, the upscatter is smaller than PLp’s. 
MLv provides a much better match to the simulation’s 
true mass function across a larger mass range, compa¬ 
rable to those of SZ, WL, and x-ray for the large mock 
observation of « (\Gpch~^Y. This agreement with the 
true HMF is primarily due to the small spread in er¬ 
rors associated with these methods; abundant, low-mass 
clusters tend not to be catastrophically overpredicted by 
methods with small intrinsic scatter. The smaller errors 
produced in SDM’s mass prediction results in a more 
accurate representation of the halo mass function, par- 
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ticularly at the high-mass end. SDM’s ability to more 
accurately recreate the true halo mass function makes it 
a valuable tool for producing cluster mass functions to 
evaluate cosmological models. The predictive power of 
SDM to reproduce the true halo mass function and its 
implications for constraining cosmological parameters erg 
and Qm will be explored in detail in an upcoming work. 

Section explores how the aperture and, less directly, 
the purity and completeness of the cluster sample, affect 
the scatter in both power law and machine learning dy¬ 
namical masses. We find that the power law ht changes 
as a function of aperture, shallowing with smaller aper¬ 
ture. When a large aperture is used, the distribution 
of errors at low masses is not lognormal, but is better 
described by a double Gaussian (see Figure [II|). 

With the simple cylindrical cut and 2-cr paring used in 
this work, mock cluster observations performed with a 
large aperture will tend to be more complete (compared 
to a mock observation made with a smaller aperture), 
with cluster members near the edges of the cluster be¬ 
ing included in the sample. Mock observations with a 
smaller aperture will tend to be more pure, with fewer 
interlopers contaminating the observation. As we will 
show in Section SDM performs slightly better with 
a large aperture, showing a preference for completeness 
over one for purity. 

One may consider improving SDM mass predictions 
further by training and testing on features beyond simply 
R and uios, applying a more accurate cluster interloper 
removal technique, or limiting the training sample to a 
particular subpopulation of galaxies. Because elliptical 
galax ies preferentially reside in galaxy clusters (Dressier 
19801, limiting the training sample to this population 


may provide a straightforward and natural approach to 
excluding many interlopers while still providing limited 
information about infalling matter. But before such a 
training set can be explored and applied to observational 
data, there remains a need for a reliable training A^-body 
simulation that is large, high resolution, and realistically 
populated with galaxies. 

5. CONCLUSIONS 


We compare cluster mass predictions from a standard 
M(cr„) power-law scaling relation to those generated by 
support distribution machines (SDMs), a machine learn¬ 
ing class of algorithms that learn from a distribntion of 
data to predict a scalar. 

We focus on mass predictions for a mock catalog 
of impure and incomplete clusters. This catalog is 
created from the publicly available Multidark MDPLl 
simulation, with an intentionally-simplistic cylindrical 
cut imposed around the known centers of clusters 
with true mass > 1 x The aperture 

(i?aperture = 1 . 6 Mpc/i“^) and initial velocity cut {veut = 
2500 km s“^) correspond to a typical radius and 2 x cr^ 
of a halo with mass 1 x 10^^ Mq h~^. Velocity outliers 
beyond 2tT„ are iteratively pared until convergence, and 
only clusters with at least 20 cluster members are kept 
in the sample. This creates a catalog of clusters that 
are both impure (interlopers contaminate the clusters) 
as well as incomplete (some true cluster members are ex¬ 
cluded from the sample). A second catalog, both pure 
and complete, is also prepared for comparison. 

Cluster masses are predicted in two ways: in the PL ap¬ 


proach, a standard M(av) power law is used to train and 
test, while in the ML approach, SDM is utilized. Four 
feature sets are considered with SDM: MLv (absolute 
value of the line-of-sight velocity, |uios|), MLr (galaxy 
projected distance from the cluster center, R), MLv 3 
(|uios| and R), and MLv,cr,R (|wios|, kios|/o-„, and R). Re¬ 
sults for halos with true mass M > 3 x 10^"^ Mq h~^ are 
reported. 

Our main conclusions can be summarized as follows: 

1. MLv and MLr (SDM with |uios| feature only and 
SDM with R feature only, respectively) reduce er¬ 
rors by 69% compared to a power law applied to 
the same Contaminated Catalog. 

2. Further, though a simple cylindrical cut causes sig¬ 
nificant scatter in the M(cr„) power law compared 
to when the cluster membership is perfectly known, 
both SDM methods each outperform PLp, a power 
law applied to a catalog with pure and complete 
clusters. Compared to this ideal power law, MLv 
and MLr each reduce error by 23%. 

3. Though it reduces error width, MLv,r (SDM 
with |uios| and R) systematically underpredicts the 
highest-mass clusters. It is identified as a disfa¬ 
vored method. 

4. MLv,cr,R (SDM with |uios|, |uios|/cr«, and R) min¬ 
imizes the bias for the high-mass clusters (M > 
7 X 10^^Mq/i“^). It reduces error by 64% and 
12% compared to PLc and PLr, respectively. 

5. In some instances, a higher-than-expected (T„ 
causes a catastrophic overprediction by method 
PLc- The ML methods, however, predict reason¬ 
able masses for even these outliers. 

The SDM approach to determining cluster mass from 
galaxy observables reduces errors by more than a fac¬ 
tor of two compared to a standard power-law scaling 
approach applied to a cluster catalog with impure, in¬ 
complete cluster membership information. SDM predicts 
cluster masses well even when a traditional M(ct^) ap¬ 
proach fails. Additionally, this technique works well even 
with catalogs of impure and incomplete clusters created 
with a simplistic cylindrical ent about the cluster center. 
Ultimately, high-resolution, large-volume simulations are 
needed for training before SDM can be applied to obser¬ 
vation. With such a simulation for training, the reduced 
errors and more accurate predictions for impure, incom¬ 
plete, nonvirialized systems makes SDM a valuable tool 
for constraining cosmological models. 
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MultiDark database was developed in cooperation with 
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APPENDIX 

Here, we explore how our choices of i?aperture and Vcut affect the PL and ML predictions and results. Two new 
catalogs are prepared to correspond to a 3 x cluster (i?aperture = l.lMpc/i”^ and Vcnt = 1570kms“^, 

denoted “Small Aperture”) and 3 x cluster (Aaperture = 2.3Mpc/i“^ and Ucut = 3785kms“^, denoted 

“Large Aperture”). The Contaminated Catalog used in the main body of this work has been renamed “Medium 
Aperture” for clarity. As before, a 2-a iterative paring scheme is applied to the initial cylindrical cut. With the 
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TABLE 4 
Catalog Summary 


Catalog 

Name 

Type 

-Raperture 

(Mpc/1,-1) 

"Ucut 

(kms“^) 

0-15 

(kms“^) 

a 

Small 

Aperture 

PL Train 

1.1 

1570 

569 

0.209 

Medium 

Aperture 

PL Train 

1.6 

2500 

895 

0.384 

Large 

Aperture 

PL Train 

2.3 

3785 

900 

0.400 

Pure 

Train 

— 

— 

1244 

0.382 



^Small Aperture 

3.4 



Medium Aperture 



Large Aperture 
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Fig. 10.— Left: Small Aperture Catalog’s LOS velocity dispersion of galaxies, <7v, vs. cluster mass, M, shown as a 2D histogram. Only 
clusters above 3 x Mq (black dash dotted) are used to determine the best fit power law (black solid); the small aperture and 
i^cut lead to smaller-than-expected cr^i’s for the high mass halos and result in a shallow fit. The M(av) fit for pure and complete clusters 
(PLp, red) is overlaid for reference. Center: Medium Aperture Catalog. If the lognormal scatter in ay was consistent across the entire 
mass range, the 1- and 2-a errors (black dashed and dotted, respectively) calculated at the high-mass end would describe the scatter in ay 
even at low masses. However, a clear trend emerges, with increased scatter in ay at lower masses. Right: Large Aperture Catalog. The 
slope of the power law has steepened. This is due to the larger Raperture and Ucut used for this catalog, which capture more true members 
of the high-mass clusters, allowing these objects to be more accurately described. Though the high-mass clusters are now well-represented 
by their measured ay, a clear second population emerges at low mass and high ay, with 20% of halos with M < 3 x 10^"^ Mq h~^ lying 
above the 2-a dotted line. 


exception of the i?aperture and z;cut values, the methods described in Sec. are followed. These catalogs, along with 
the Pure Catalog, are summarized in Table 

Figure 10 shows how the choice of i?aperture and t’cut affect the power-law fits. This two-dimensional histogram of 
(Ty vs. M snows that the best fit a and (T„, as well as the scatter about the best fit line, changes as a function of initial 
cylinder size. Overlaid on the two-dimensional histogram is a best fit with 1- and 2-a lognormal errors, calculated for 
clusters with mass above 3 x 10 ^^ Mq h~^ and extrapolated down to lower masses. Additionally overlaid is the best 
fit power law for PLp. 

When the Small Aperture cuts are applied, this overly-small cylinder clips the (j„ values at the high mass. This leads 
a shallow slope (a = 0.209) and small velocity dispersion associated with a 10^^ Mq h~^ cluster (uis = 569kms“^). In 
contrast, a large cylindrical fit increases scatter at the low-mass end. The resulting fit for the Large Aperture Catalog 
is steep (a = 0.384) and has a higher normalization (ctis = 895kms“^) caused by the many high-cr« objects and the 
substantial fraction of outliers above the 2-tT line. These catalogs and fits are summarized in Table for reference. 

As the Large Aperture Catalog’s cuts are used to probe lower masses, a bimodal distribution emerees with a second 
population of clusters residing far above the best fit; this second population is visible in Figure!^ These high-cr„, 
low-mass objects increase scatter at the low-mass end. More worrisome, they have a velocity dispersion typically 
associated with clusters of roughly an order of magnitude larger in mass. Of halos with M > 3 x 10^^ Mq h~^, 3% 
reside above the -|-2cr dotted line and 3% reside below the —2cr dotted line. However, of halos with 1 x 10^"^ Mq h~^ < 
M < 3 X 10^^ Mq h~^, 20% reside above -|-2cr and 3% below — 2cr. The best fit and lognormal scatter found for the 
higher-mass clusters in the Large Aperture Catalog is clearly not descriptive of the lower-mass clu ster s. 

To further explore this outlier population, we will consider the residual, 6 (Equation]^. Figure 11 shows that the 


Large Aperture Catalog has a residual PDF is adequately described by a single Gaussian, parameterized by 


PDF oc exp 


2 0-2 

gauss J 


( 1 ) 
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Fig. 11.— Left: PDF of residual, <5, for the Large Aperture Catalog. With a lower mass cut of M = 3 X 10^^ Mq h the PDF of 
clusters’ & (thin black) is well-described by a single Gaussian (thick blue). Right: When the mass limit of the Large Aperture Catalog is 
lowered to M = 1 X 10^^ Mq h~^, the PDF is better-described by a double Gaussian. Observational methods for identifying members of 
this outlier population will be explored in a later work. 


with best fit width cigauss = 0.13 and a nearly-zero offset, /i = 0.01. 

However, when the lower mass limit of this Large Aperture Catalog is decreased to 1 x 10^^ Mq h~^, the 5 PDF 
is better described by the sum of two Gaussians, as shown Figure The relative amplitude and width of high-5 
Gaussian is dependent on the minimum mass cut applied to the catalog, and our choice of 1 x 10^"^ Mq h~^ is for 
illustrative purposes only. Note, however, that the zero-centered Gaussian has Cgauss = 0.16 and fi — 0.03, comparable 
to the single Gaussian fit found previously. This is suggestive that a single lognormal scatter describes the population 
that is well-characterized by the M((T„) power law, while a second population of high-cr^, outliers emerges at low masses. 
Exploring observational methods for describing and identifying members of this outlier population will be considered 
in future wo rk. 

Figure shows that the resulting large scatter produces PLc error PDF that is wide and flat as before, with the 
shape of tEe PLc PDF dependent on the cylindrical cut parameters. For the Small Aperture Gatalog, the shallow fit 
coupled with the large number of clusters with large negative 6 contribute to the substantial population of clusters 
being underestimated by an order of magnitude or more (e < —0.1). SDM produces a slightly wider error distribution 
for this small initial cylinder compared to the Medium Aperture cuts, though still reducing Ae compared to both PLc 
and PLp. Distributions of error as a function of mass are comparable to those seen in Figure]^ regardless of the 
training catalog, though e tends to decrease and Ae tends to widen for small initial cylinders. 

As before, there are also a number of catastrophically overpredicted clusters by applying the PLc scaling relation 
to the Small Aperture Gatalog. These overpredicted objects are identified by their residual relative to the lognormal 
scatter: 5 > 1.5 x cTgauss- The shallow slope leads to the overprediction being much more pronounced. However, Figure 
shows that, even in this case, SDM predicts reasonably accurate masses for these objects. 

The population of high-fj^,, low-mass, high-5 objects in the Large Aperture Gatalog simil arly produces a substantial 
number of catastrophically overpredicted clusters. These large-e objects shown in Figure 12 are also well-predicted 


by SDM. While the PLc gives a large range of errors, SDM can more accurately predict these cluster masses despite 
overly-large or small cylindrical cuts that contribute to significant impurity or incompleteness in the mock clusters. 

MLv and MLv^o-.r produce the smallest Ae when the initial cylinders are large, with Ae = 0.670 and 0.763, re¬ 
spectively, for the Medium Aperture Gatalog and Ae = 0.660 and 0.752 for the Large Aperture Gatalog. The Small 
Aperture Catalog error distribution is wider: Ae = 0.809 and 0.898. However, in all cases except MLv,ct,r applied to 
the Small Aperture cylinder, the width of error distribution is narrower than the Pure Catalog power law, which has 
Ae = 0.871. SDM performs better with impurity over incompleteness, with larger cylinders producing slightly more 
accurate mass predictions. 

Errors produced by a power-law scaling relation are clearly dependent on the choices of i?aperture and v^ut, sometimes 
catastrophically overpredicting cluster masses. Though a standard power-law scaling fits and error distributions are 
sensitive to choices in cuts, SDM can predict accurately under a wide range of scenarios, provided the training and 
test data have the same imposed cuts. 
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Fig. 12. — Top Left: PDF of errors for the Small Aperture Catalog. When this small cut is imposed on the mock observation, the shallow 
slope of the fit causes large-negative-(5 population to be underpredicted in mass by an order of magnitude or more, creating the abundance 
of clusters with e < 0.1. Top Center: Small Aperture Catalog, high mass halos only (M > 7 x lO^'^ Mq h~^) has a similar abundance of 
underpredicted halo masses. Top Right: PDF of errors for the high-error objects. The shallow Small Aperture fit also results in a number 
of catastrophically overpredicted clusters. SDM, however, predicts reasonable masses for even these outliers. Bottom Left: PDF of errors 
for the Large Aperture Catalog. The large cut leads to more interlopers, but SDM predicts better than a scaling relation applied to a pure 
and complete catalog. Bottom Center: Large Aperture Catalog, high mass halos only. Bottom Right: PDF of high-error objects for the 
Large Aperture Catalog. SDM predicts reasonably accurate masses here, though a power-law scaling relation fails catastrophically. 









